gguf.md: Add GGUF Naming Convention Section #822

mofosyne · 2024-05-14T06:57:49Z

This PR is based on outfile default name generation in ggerganov/llama.cpp#4858, copied from there but removed historical references and justification to why it was designed that way.

Feedback and adjustment will be appreciated. Any changes to this will mean we also need to update llama.cpp default name generation as well.

In addition, is there any filename generation in this repo? If so then we may want to also update it as well to use this common naming scheme.

julien-c · 2024-05-14T09:55:24Z

interesting! we could probably parse it on HF side in the future if it makes sense and if it unlocks cool features (we already attempt to extract quantization type from filename but this could make it more robust. cc @mishig25)

Vaibhavs10 · 2024-05-14T10:51:41Z

If it helps, we follow somewhat similar (but not exhaustive) in the gguf-my-repo quantisation space: https://huggingface.co/spaces/ggml-org/gguf-my-repo/blob/main/app.py#L67

Standardisation in file name is always a great move!

mofosyne · 2024-05-14T11:34:44Z

If it helps, we follow somewhat similar (but not exhaustive) in the gguf-my-repo quantisation space: https://huggingface.co/spaces/ggml-org/gguf-my-repo/blob/main/app.py#L67

Standardisation in file name is always a great move!

Had a quick look. Do you mean your current naming arrangement is kind of like

<model_name>/<model_name.lower()>.<q_method.upper()>.gguf so something like TinyStories/tinystories-F16.gguf

If so then do you have a preferred form? I came to this form basically by casual observation of typical naming scheme in huggingface, hence <Model>-<Version>-<ExpertsCount>x<Parameters>-<Quantization>.gguf (where version field is optional).

But obviously it's research by vibes, so it be better if I had some feedback, especially for those that be forced to try and parse such files. Ergo @JidongZhang-THU would it make it easier for you if we made 'version' not optional? (Expert count being optional is okay as it's easy to tell that x is either there or not there).

And of course... did I miss anything that would be useful for people parsing models file names?

docs/gguf.md

Vaibhavs10 · 2024-05-14T13:29:23Z

Let me add a bit of backstory as to why we chose this naming scheme (which I'm more than happy to change):

A typical user of the quantisation space would want to create quants for an arbitrary model on the Hub.
To preserve the info about the model itself, we copy the model-id (which can be different from the model name, for example there are 4.5k+ mixtral models on the hub all with different names based on their training set, and other metadata.) and then append the information about the quant time along with.

The model name typically already would have the information about expert count + parameters.

I'm open to ideas to align better, I just thought I'd provide more context.

mofosyne · 2024-05-15T00:26:04Z

@julien-c do you have a preference when it comes to parsing filenames? I'm basically treating it as a sort of - dash separated value (in which case, I should probably make the version field mandatory).

julien-c · 2024-05-15T06:15:41Z

@mofosyne no, we'll adapt!

docs/gguf.md

mofosyne · 2024-05-15T10:25:28Z

Thanks for the historical context. I might have gotten a bit crazy here, but I've ended up mapping each enum name to the tensor type description and the historical context behind each PR that relates to it's initial inclusion...

Not even sure if it's allowed on this gguf.md page, so just attaching it to this comment in case I should remove it. But hopefully it helps provide a general glance of each gguf file type.

Oh and i've updated the page a bit. Opted for 'tensor type' rather than 'file name' as that appears to make more sense to me at least.

TensorType	enum ggml_ftype	Tensor Type Description (Plus link to initial commit in llama.cpp for historical context)
F32	GGML_FTYPE_ALL_F32	32-bit IEEE 754 llama.cpp CM
F16	GGML_FTYPE_MOSTLY_F16	16-bit IEEE 754 llama.cpp CM
Q4_0	GGML_FTYPE_MOSTLY_Q4_0	4 bit quant (scaling only) llama.cpp CM
Q4_1	GGML_FTYPE_MOSTLY_Q4_1	4 bit quant (scaling plus offset) llama.cpp CM
Q4_1_F16	GGML_FTYPE_MOSTLY_Q4_1_SOME_F16	4 bit quant (scaling plus offset) except for tok_embeddings and output weights which are F16 llama.cpp CM
Q8_0	GGML_FTYPE_MOSTLY_Q8_0	8 bit quant (scaling only) llama.cpp CM
Q5_0	GGML_FTYPE_MOSTLY_Q5_0	5 bit quant (scaling only) llama.cpp CM
Q5_1	GGML_FTYPE_MOSTLY_Q5_1	5 bit quant (scaling plus offset) llama.cpp CM
kQ2	GGML_FTYPE_MOSTLY_Q2_K	2 bits k-quant (SOTA) llama.cpp PR
kQ3	GGML_FTYPE_MOSTLY_Q3_K	3 bits k-quant (SOTA) llama.cpp PR
kQ4	GGML_FTYPE_MOSTLY_Q4_K	4 bits k-quant (SOTA) llama.cpp PR
kQ5	GGML_FTYPE_MOSTLY_Q5_K	5 bits k-quant (SOTA) llama.cpp PR
kQ6	GGML_FTYPE_MOSTLY_Q6_K	6 bits k-quant (SOTA) llama.cpp PR
iQ2_XXS	GGML_FTYPE_MOSTLY_IQ2_XXS	2.0625 bits per weight quants (SOTA) llama.cpp PR
iQ2_XS	GGML_FTYPE_MOSTLY_IQ2_XS	2.31 bits per weight quants (SOTA) llama.cpp PR
iQ3_XXS	GGML_FTYPE_MOSTLY_IQ3_XXS	3.0625 bit per weight quants (SOTA) llama.cpp PR
iQ1_S	GGML_FTYPE_MOSTLY_IQ1_S	1.5 bit per weight quants llama.cpp PR
iQ4_NL	GGML_FTYPE_MOSTLY_IQ4_NL	4 bit non-linear quants with blocks of 32 llama.cpp PR
iQ3_S	GGML_FTYPE_MOSTLY_IQ3_S	3.4375 bits per weight quants llama.cpp PR
iQ2_S	GGML_FTYPE_MOSTLY_IQ2_S	2 to 3 bit per weight quants llama.cpp PR
iQ4_XS	GGML_FTYPE_MOSTLY_IQ4_XS	4.25 bits per weight quants llama.cpp PR
iQ1_M	GGML_FTYPE_MOSTLY_IQ1_M	1.75 bits per weight quants llama.cpp PR
BF16	GGML_FTYPE_MOSTLY_BF16	bfloat16 (truncated 32-bit IEEE 754) llama.cpp PR

mishig25 · 2024-05-15T13:05:55Z

@mofosyne, I've made a similar table of quant descriptions in https://huggingface.co/docs/hub/gguf#quantization-types (sharing just in case if there's any useful info)

mofosyne · 2024-05-15T17:03:39Z

@mishig25 thanks. Decided to cross reference your table with what I got and this is the breakdown i was able to figure out. I'm not 100% sure on all the superblock configuration for the i-quantization based on your statement and the llama.cpp PR description, but I was able to extract some out. I think gg would be a clearer source of truth here (especially some of my general assertions below).

Encoding Scheme Name Table

Scheme	`ggml_ftype` C enumeration name	Bits/Weight	Data Type	Block Configuration	Quantized Weight Formula	Initial Commits Or Pull Request Sources
F32	GGML_FTYPE_ALL_F32	32	32-bit IEEE 754	-	-	llama.cpp CM
F16	GGML_FTYPE_MOSTLY_F16	16	16-bit IEEE 754	-	-	llama.cpp CM
Q4_0	GGML_FTYPE_MOSTLY_Q4_0	4	round to nearest quantization	Each block has 32 weights	w = q * block_scale	llama.cpp CM
Q4_1	GGML_FTYPE_MOSTLY_Q4_1	4	round to nearest quantization	Each block has 32 weights	w = q * block_scale + block_minimum	llama.cpp CM
Q4_1_F16	GGML_FTYPE_MOSTLY_Q4_1_SOME_F16	4	round to nearest quantization	Each block has 32 weights (token embedding and output weights are F16)	w = q * block_scale + block_minimum	llama.cpp CM
Q8_0	GGML_FTYPE_MOSTLY_Q8_0	8	round to nearest quantization	Each block has 32 weights	w = q * block_scale	llama.cpp CM
Q5_0	GGML_FTYPE_MOSTLY_Q5_0	5	round to nearest quantization	Each block has 32 weights	w = q * block_scale	llama.cpp CM
Q5_1	GGML_FTYPE_MOSTLY_Q5_1	5	round to nearest quantization	Each block has 32 weights	w = q * block_scale + block_minimum	llama.cpp CM
KQ2	GGML_FTYPE_MOSTLY_Q2_K	2.5625	k-quantization	Superblocks with 16 blocks, each block has 16 weights	w = q * block_scale (4-bit) + block_min (4-bit)	llama.cpp PR
KQ3	GGML_FTYPE_MOSTLY_Q3_K	3.4375	k-quantization	Superblocks with 16 blocks, each block has 16 weights	w = q * block_scale (6-bit)	llama.cpp PR
KQ4	GGML_FTYPE_MOSTLY_Q4_K	4.5	k-quantization	Superblocks with 8 blocks, each block has 32 weights	w = q * block_scale (6-bit) + block_min (6-bit)	llama.cpp PR
KQ5	GGML_FTYPE_MOSTLY_Q5_K	5.5	k-quantization	Superblocks with 8 blocks, each block has 32 weights	w = q * block_scale (6-bit) + block_min (6-bit)	llama.cpp PR
KQ6	GGML_FTYPE_MOSTLY_Q6_K	6.5625	k-quantization	Superblocks with 16 blocks, each block has 16 weights	w = q * block_scale (8-bit)	llama.cpp PR
IQ2_XXS	GGML_FTYPE_MOSTLY_IQ2_XXS	2.0625	i-quantization	Superblocks with 8 blocks, each block has 32 weights	w = func(superblock_scale, importance_matrix)	llama.cpp PR
IQ2_XS	GGML_FTYPE_MOSTLY_IQ2_XS	2.31	i-quantization	Superblocks with 16 blocks, each block has 16 weights	w = func(superblock_scale, importance_matrix)	llama.cpp PR
IQ3_XXS	GGML_FTYPE_MOSTLY_IQ3_XXS	3.0625	i-quantization	Superblocks with 8 blocks, each block has 32 weights	w = func(superblock_scale, importance_matrix)	llama.cpp PR
IQ1_S	GGML_FTYPE_MOSTLY_IQ1_S	1.5	i-quantization	Superblocks with 8 blocks, each block has 32 weights	w = func(superblock_scale, importance_matrix)	llama.cpp PR
IQ4_NL	GGML_FTYPE_MOSTLY_IQ4_NL	4.5	i-quantization	Superblocks with 16 blocks, each block has 16 weights	w = [non linear mapping of quants to weights]	llama.cpp PR
IQ3_S	GGML_FTYPE_MOSTLY_IQ3_S	3.4375	i-quantization	?	w = func(superblock_scale, importance_matrix)	llama.cpp PR
IQ2_S	GGML_FTYPE_MOSTLY_IQ2_S	2.5	i-quantization	?	w = func(superblock_scale, importance_matrix)	llama.cpp PR
IQ4_XS	GGML_FTYPE_MOSTLY_IQ4_XS	4.25	i-quantization	Superblocks with 8 blocks, each block has 32 weights	w = func(superblock_scale, importance_matrix)	llama.cpp PR
IQ1_M	GGML_FTYPE_MOSTLY_IQ1_M	1.75	i-quantization	Superblocks with 16 blocks, each block has 16 weights	w = func(superblock_scale, importance_matrix)	llama.cpp PR
BF16	GGML_FTYPE_MOSTLY_BF16	16	bfloat16 (trunc 32b IEEE754)	-	-	llama.cpp PR

All superblocks have fp16 scaling factor and contains up to 256 weights. Number of weights in a block must be divisible by 256.

mofosyne · 2024-05-16T01:31:10Z

@mishig25 when you made the table, were you able to figure out the superblocks makeup and how to represent the weight formulae (in general)? Also in your opinion, is this table in the right location or should it be split up (and if so then where)?

(And on a meta note... how much information should we really expose in this document... too much can confuse developers)

edit: Justine T also mentioned regarding my 'Weights Encoding Scheme' table is that I may have issue using different name for quants than what the software (presumably llama.cpp) uses. So I guess we could say this is not a super hard and fast mapping, but can include other variants... but for the context of ggml this is the base scheme name. llama.cpp can then define their own extra naming (e.g. _S, _M and _L) in their own documentation (as extra pointers for users of what to expect).

ggerganov · 2024-05-16T07:49:28Z

(And on a meta note... how much information should we really expose in this document... too much can confuse developers)

IMO, the entire encoding section should just be reduced to simply:

Indicates the weights encoding scheme that was applied to the model. Content, type mixture and arrangement however are determined by user code and can vary depending on project needs.

The rest of the information is specific mainly to llama.cpp and not relevant to the GGUF format

mofosyne · 2024-05-16T08:12:33Z

@ggerganov thanks, looks much more compact and focused now.

@mishig25 @julien-c @Vaibhavs10 @Green-Sky let's lock this in?

(Wonder if the table with bits, datatype, block config, etc... be useful anywhere, such as llama.cpp documentation and if so then which specific location)

mofosyne · 2024-05-17T06:50:06Z

@ggerganov thanks for the merge

I've decided to place the table to https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes . Turns out github wiki kind of suck at rendering tables, but hope it's of help to everyone here.

mishig25 · 2024-05-17T13:02:26Z

docs/gguf.md

@@ -18,6 +18,43 @@ GGUF is a format based on the existing GGJT, but makes a few changes to the form

 The key difference between GGJT and GGUF is the use of a key-value structure for the hyperparameters (now referred to as metadata), rather than a list of untyped values. This allows for new metadata to be added without breaking compatibility with existing models, and to annotate the model with additional information that may be useful for inference or for identifying the model.

+### GGUF Naming Convention
+
+GGUF follow a naming convention of `<Model>-<Version>-<ExpertsCount>x<Parameters>-<EncodingScheme>.gguf`


@mofosyne great work!

Maybe the only missing information is to add the optional suffix about the shard info.
Example: "grok-1/grok-1-q4_0-00003-of-00009.gguf"

gguf.md: Add GGUF Naming Convention Section

bda6204

Green-Sky reviewed May 14, 2024

View reviewed changes

docs/gguf.md Outdated Show resolved Hide resolved

mofosyne force-pushed the 820-naming-convention-for-gguf branch 3 times, most recently from 1a02aee to 43e8e45 Compare May 14, 2024 13:45

gguf.md: add BF16

1bf1ab5

mofosyne force-pushed the 820-naming-convention-for-gguf branch from 43e8e45 to 1bf1ab5 Compare May 14, 2024 13:49

gguf.md: GGUF Filename Parsing Strategy

7d9bd43

mofosyne force-pushed the 820-naming-convention-for-gguf branch from b489520 to 7d9bd43 Compare May 15, 2024 07:41

ggerganov reviewed May 15, 2024

View reviewed changes

docs/gguf.md Outdated Show resolved Hide resolved

gguf.md: include tensor type table and historical context

492b113

gguf.md: minor corrections

5d8465a

gguf.md: more detailed breakdown of tensor type mapping

37d584b

gguf.md: use Encoding Scheme name instead

9a81fce

mofosyne requested review from ggerganov and Green-Sky May 16, 2024 01:28

gguf.md: minor correction to overall naming convention

13ed330

gguf.md: simplify GGUF Naming Convention

aa4f3ae

ggerganov approved these changes May 17, 2024

View reviewed changes

ggerganov merged commit 9988298 into ggerganov:master May 17, 2024

mofosyne deleted the 820-naming-convention-for-gguf branch May 17, 2024 06:50

mishig25 reviewed May 17, 2024

View reviewed changes

mofosyne mentioned this pull request May 17, 2024

gguf.md: add sharding to naming convention #826

Merged

compilade mentioned this pull request May 24, 2024

gguf : use Qn_K for k-quants instead of KQn #837

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gguf.md: Add GGUF Naming Convention Section #822

gguf.md: Add GGUF Naming Convention Section #822

mofosyne commented May 14, 2024 •

edited

Loading

julien-c commented May 14, 2024

Vaibhavs10 commented May 14, 2024

mofosyne commented May 14, 2024 •

edited

Loading

Vaibhavs10 commented May 14, 2024

mofosyne commented May 15, 2024

julien-c commented May 15, 2024

mofosyne commented May 15, 2024

mishig25 commented May 15, 2024 •

edited

Loading

mofosyne commented May 15, 2024 •

edited

Loading

mofosyne commented May 16, 2024 •

edited

Loading

ggerganov commented May 16, 2024

mofosyne commented May 16, 2024

mofosyne commented May 17, 2024

mishig25 May 17, 2024

gguf.md: Add GGUF Naming Convention Section #822

gguf.md: Add GGUF Naming Convention Section #822

Conversation

mofosyne commented May 14, 2024 • edited Loading

julien-c commented May 14, 2024

Vaibhavs10 commented May 14, 2024

mofosyne commented May 14, 2024 • edited Loading

Vaibhavs10 commented May 14, 2024

mofosyne commented May 15, 2024

julien-c commented May 15, 2024

mofosyne commented May 15, 2024

mishig25 commented May 15, 2024 • edited Loading

mofosyne commented May 15, 2024 • edited Loading

Encoding Scheme Name Table

mofosyne commented May 16, 2024 • edited Loading

ggerganov commented May 16, 2024

mofosyne commented May 16, 2024

mofosyne commented May 17, 2024

mishig25 May 17, 2024

Choose a reason for hiding this comment

mofosyne commented May 14, 2024 •

edited

Loading

mofosyne commented May 14, 2024 •

edited

Loading

mishig25 commented May 15, 2024 •

edited

Loading

mofosyne commented May 15, 2024 •

edited

Loading

mofosyne commented May 16, 2024 •

edited

Loading